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Abstract 

Several researchers have experimentally shown that substantial improvements can be ob- 
tained in difficult pattern recognition problems by combining or integrating the outputs of mul- 
tiple classifiers. This chapter provides an analytical framework to quantify the improvements in 
classification results due to combining, The results apply to both linear combiners and order 
statistics combiners. We first show that to a first order approximation, the error rate obtained 
ovpr and above the Bayes error rate, is directly proportional £o"fIie variance of the actual Tension 
boundaries around the Bayes optimum boundary. Combining classifiers in output space reduces 
this variance, and hence reduces the ’added" error. If N unbiased classifiers are combined by 
simple averaging, the added error rate can be reduced by a factor of .V if the individual errors in 
approximating the decision boundaries are uncorrelated. Expressions are then derived for linear 
combiners which are biased or correlated, and the effect of output correlations on ensemble per- 
formance is quantified. For order statistics based non-linear combiners, we derive expressions 
that indicate how much the median, the maximum and in general the ith order statistic can 
improve classifier performance. The analysis presented here facilitates the understanding of 
the relationships among error rates, classifier boundary distributions, and combining in output 
space. Experimental results on several public domain data sets are provided to illustrate the 
benefits of combining and to support the analytical results. 


1 Introduction 

Training a parametric classifier involves the use of a training set of data with known labeling to 
estimate or “learn" the parameters of the chosen model. A test set, consisting of patterns not 
previously seen by the classifier, is then used to determine the classification performance. This 
ability to meaningfully respond to novel patterns, or generalize, is an important aspect of a classifier 
system and in essence, the true gauge of performance [38, 77]. Given infinite training data, consistent 
classifiers approximate the Bayesian decision boundaries to arbitrary precision, therefore providing 
similar generalizations [24]. However, often only a limited portion of the pattern space is available or 
observable [L6, 22]. Given a finite and noisy data set. different classifiers typically provide different 
generalizations by realizing different decision boundaries [2Gj. For example, when classification is 
performed using a multilayered, feed-forward artificial neural network, different weight initializations, 




Figure 1: Combining Strategy. The solid lines leading to f tncl represent the decision of a specific 

classifier, while the dashed lines lead to f comb , the output of the combiner. 


• relates the location of the decision boundary to the classifier error. 

The rest of this article is organized as follows. Section 2 introduces the overall framework for 
estimating error rates and the effects of combining. In Section 3 we analyze linear combiners, 
and derive expressions for the error rates for both biased and unbiased classifiers. In Section 4, 
we examine order statistics combiners, and analyze the resulting classifier boundaries and error 
regions. In Section 5 we study linear combiners that make correlated errors, derive their error 
reduction rates, and discuss how to use this information to build better combiners. In Section 6. we 
present experimental results based on real world problems, and we conclude with a discussion of the 
implications of the work presented in this article. 


2 Class Boundary Analysis and Error Regions 

Consider a single classifier whose outputs are expected to approximate the corresponding a posteriori 
class probabilities if it is reasonably well trained. The decision boundaries obtained by such a 
classifier are thus expected to be close to Bayesian decision boundaries. Moreover, these boundaries 
will tend to occur in regions where the number of training samples belonging to the two most locally 
dominant classes (say. classes i and j) are comparable. 

We will focus our analysis on network performance around the decision boundaries. Consider the 
boundary between classes i and j for a single-dimensional input (the extension to multi-dimensional 
inputs is discussed in [73]). First, let us express the output response of the ith unit of a one-of-L 
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when* p[.( ) <lenor,.*s the derivative of >*(•). With this substitution. Equation 2 beeoti.-s: 

+• bp[[x’) r = p, I r * ) + bp'^x") + f/.o,). 

Now. since p t ( jt" ) = p ; (x"), wo get: 

6(p'(x*) - p'fx*)) = - *,(.r*). 


Finally we obtain: 


6 = 


* t <Xfr) ^ €j(X6) 


( 5 ) 


where: 


5 = p'(x\) - p'(x*). (6) 

Let the error e l (xf > ) be broken into a bias and noise term (ei(x&) = 3 X -F Pi(x^)). Note that the 
term ‘'bias” and ‘‘noise 1 ’ are only analogies, since the error is due to the classifier as well as the data. 
For the time being, the bias is assumed to be zero (i.e. 3k = 0 Vfc). The case with nonzero bias will 
be discussed at the end of this section. Let denote the variances of which are taken to be 

i.i.d. variables 3 . Then, the variance of the zero-mean variable 6 is given by (using Equation 5): 



Figure 2 shows the a posteriori probabilities obtained by a non-ideal classifier, and the associated 
added error region. The lightly shaded area provides the Bayesian error region. The darkly shaded 
area is the added error region associated with selecting a decision boundary that ts offset by 6, 
since patterns corresponding to the darkly shaded region are erroneously assigned to class i by the 
classifier, although ideally they should be assigned to class j. 

The added error region, denoted by .4(6). is given by: 

.4(6) = j (pj(x) - pi(x))dx. (8) 

Based on this area, the expected added error, E a dd , is given by: 

E add = r A(b)Mb)db , ( 9 ) 

J — oo 

where /j, is the density function for 6. More explicitly, the expected added error is: 

E add — f [ iPj{£) ~ P,{x)) fb(b) dxdb. 

J -oo J z* 

One can compute .4(6) directly by using the approximation in Equation 3 and solving Equation 8. 
The accuracy of this approximation depends on the proximity of the boundary to the ideal boundary. 
However, since in general, the boundary density decreases rapidly with increasing distance from the 

i Each output of each network does approximate a smooth function, and therefore the noise for two nearby patterns 
on the same class (i.e. /j fc (x) and -f- ^x)) is correlated. The independence assumption applies to tnter-class noise 
(i.e. r? v (x) and not intra-class noise. 



3 Linear Combining 

3.1 Linear Combining of Unbiased Classifiers 

Let 11 s now divert our attention to the effects of linearly combining multiple classifiers. In what 
follows, the combiner denoted by ave performs an arithmetic average in output space. If .V classifiers 
are available, the tth output of the ave combiner provides an approximation to p,(x) given by: 

1 V 

/r e (x) = - £/, m w. t 16 > 


or: 


/“ ve (x) = Pi(. r) + & + n t (x ) , 


where: 


, * 

n>(*) = Y2 vrM 


and 


m= 1 


*V 


A" 


A* 

— m= i- - — ■ — 

If the classifiers are unbiased, 3 t — 0. Moreover, if the errors of different classifiers are i.i.d., the 
variance of fji is given by: 


O 1 V — ' 1 1 o 

= “ ’n <7 *' * 

m= 1 


The boundary x av t then has an offset b ave , where: 

f? ve (x* + b ave ) = f* v *{x m + b ave ), 


and: 


ff t (x b .v.) - ffj(xtr") 


The variance of b ave , crj,„. , can be computed in a manner similar to of. resulting in: 

o . 2 


CTijao* 


S“ 


which, using Equation 17, leads to: 
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Et {nation 24 quantifies tin* error reduction in the presence ot network bias The improvements are 
more modest than those of the previous section, since both the bias and the variance ot the noise need 
to be reduced. If both the variance and the bias contribute to the error, and their contributions are ot 
similar magnitude, the actual reduction is given by mm(r, A ). If Tie bias can be kept low (e.g. by 
purposefully using a larger network than required), then once again A becomes the reduction factor. 
These results highlight the basic strengths of combining, which not only provides improved error 
rates, but is also a method of controlling the bias and variance components of the error separately, 
thus providing an interesting solution to the bias/ variance problem :24i. 


4 Order Statistics 

4.1 Introduction 

Approaches to pooling classifiers can be separated into two main categories: simple combiners, 
e.g., averaging, and computationally expensive combiners, e.g., stacking. The simple combining 
methods are best suited for problems where the individual classifiers perform the same task, and 
have comparable success. However, such combiners are susceptible to outliers and to unevenly 
pc'r for m tng’ c lass i fie rs - Th~THe se^ifd“cal egor y , “meta-learners,” i.e.T either sets of combining rules, 
or full fledged classifiers acting on the outputs of the individual classifiers, are constructed. This type 
of combining is more general, but suffers from all the problems associated wdth the extra learning 
(e.g.. overparameterizing, lengthy training time). 

Both these methods are in fact ill-suited for problems where most (but not all) classifiers perform 
within a well-specified range. In such cases the simplicity of averaging the classifier outputs is 
appealing, but the prospect of one poor classifier corrupting the combiner makes this a risky choice. 
Although, weighted averaging of classifier outputs appears to provide some flexibility, obtaining the 
optimal weights can be computationally expensive. Furthermore, the weights are generally assigned 
on a per classifier, rather than per sample or per class basis. If a classifier is accurate only in certain 
axeas of the inputs space, this scheme fails to take advantage of the variable accuracy of the classifier 
in question. Using a meta learner that would have weights for each classifier on each pattern, would 
solve this problem, but at a considerable cost. The robust combiners presented in this section aim 
at bridging the gap between simplicity and generality by allowing the flexible selection of classifiers 
without the associated cost of training meta classifiers. 


4.2 Background 

In this section we will briefly discuss some basic concepts and properties of order statistics. Let X 
be a random variable with a probability density function f\ ( • ) , and cumulative distribution function 
Let (A t , X>, ■ • • , A.v) be a random sample drawn from this distribution. Now, let us arrange 
them in non-decreasing order, providing: 

Xii.v < Aj jV £ < A v ,v 
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on the ot.li**r hand considers the most, ■■typical" ^'presentation of cadi class. For highly noisy data, 
this combiner is more desirable than either the mm or max combiners since the decision is not 
compromised as mn< h by a single large error 

The analysis of the properties of these combiners does not depend on the order statistic chosen. 
Therefore we will denote ail three by and derive the error regions. The network output 

provided by f[ )!t ix\ is given by: 

f”(x) =p,U) + , (29) 


Let us first investigate the zero-bias case (J* = OVA:). We get t^(x) = q k 3 (x) VA,\ since the 
variations in the fcth output of the classifiers are solely due to noise. Proceeding as before, the 
boundary b°* is shown to be: 


rj° 3 (xij) — T]° s (zf,) 

b os = - - 3 ■■■■ (30) 

s 

Since q k s are i.i.d. and q^ s is the same order statistic for each class, the moments will be identical 
for each class. Moreover, taking the order statistic will shift the mean of both q° s and r?J 5 by the 
same amount, leaving the mean of the difference unaffected. Therefore, b os will have zero mean, and 
variance: 






(31) 


where o is a reduction factor that depends on the order statistic and on the distribution of b . For 
most distributions, a can be found in tabulated form [3]. For example. Table 1 provides a values 
for all three os combiners, up to 15 classifiers, for a Gaussian distribution [3. 58]. 

Returning to the error calculation, we have: \I° S = 0, and .V/ = <7£ OJ , providing: 
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os 

add — 


sM? s 

2 



2 


— Cl Eadd- 


(32) 


Equation 32 shows that the reduction in the error region is directly related to the reduction in 
the variance of the boundary offset b. Since the means and variances of order statistics for a variety 
of distributions are widely available in tabular form, the reductions can be readily quantified. 


4.4 Combining Biased Classifiers through OS 


In this section, we analyze the error regions in the presence of bias. Let us study b os in detail when 
multiple classifiers are combined using order statistics. First note that the bias and noise cannot be 
separated, since in general (a + b) os ^ a°* + b os . We will therefore need to specify the mean and 
variance of the result of each operation 6 . Equation 30 becomes: 


(A + mM) 0 * - (Jj + nj{*b)) oa 

s 


(33) 


Now, J k has mean J*, given by -L , where m denotes the different classifiers. Since 

the noise is zero-mean, ii k + //* (x&) has first moment and variance a* k +• a ^ , where — 

cL.or- j*) 2 . 

s Sirn:e the exact distribution parameters of b os are not known, we use the sample mean am! the sample variance. 


LI 



(39) 


W<* gl't. 

— Mr - ('V/Tj 4- 3" — (i,i~) 

Analyzing the error reduction in the general ease requires knowledge about the bias introduced by 
each classifier. However, it is possible to analyze the extreme causes. If each classifier hits the same 
bias for example, rr j is reduced to zero and 3 = 3. In this case the error reduction can be expressed 
as: 

^add(J) = + J 2 * ). 

where only the error contribution due to the variance of b is reduced. In this case it is important to 
reduce classifier bias before combining (e.g. by using an overparametrized model). If on the other 
hand, the biases produce a zero mean variable, i.e. they cancel each other out, we obtain 3 — 0. In 
this case, the added error becomes: 

E° add (3) = a E add {3) + S -^(4~ 

and the error reduction will be significant as long as Oj < J 2 . 


5 Correlated Classifier Combining 

5.1 Introd uction _ : 

The discussion so far focused on finding the types of combiners that improve performance. Yet, 
it is important to note that if the classifiers to be combined repeatedly provide the same (either 
erroneous or correct) classification decisions, there is little to be gained from combining, regardless 
of the chosen scheme. Therefore, the selection and training of the classifiers that will be combined 
is as critical an issue as the selection of the combining method. Indeed, classifier/data selection is 
directly tied to the amount of correlation among the various classifiers, which in turn affects the 
amount of error reduction that can be achieved. 

The tie between error correlation and classifier performance was directly or indirectly observed by 
many researchers. For regression problems, Perrone and Cooper show that their combining results 
are weakened if the networks are not independent [49]. Ali and Pazzani discuss the relationship 
between error correlations and error reductions in the context of decision trees [2]. Meir discusses the 
effect of independence on combiner performance [41], and Jacobs reports that N f < N independent 
classifiers are worth as much as N dependent classifiers [34]. The influence of the amount of training 
on ensemble performance is studied in [64]. For classification problems, the effect of the correlation 
among the classifier errors on combiner performance was quantified by the authors [70]. 


5.2 Combining Unbiased Correlated Classifiers 

In this section we derive the explicit relationship between the correlation among classifier errors and 
the error reduction due to combining. Let us focus on the linear combination of unbiased classifiers. 
Without the independence assumption, the variance of fj l is given by: 

I v jV 

= yT ^2 H cov(//’ n (x),r/ l l (s)) 

1 = l m = l 
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This expression only considers the error th.it occur between classes t and j. In order to extend this 
expression to include all the boundaries, we introduce an overall correlation term o. Then, the added 
error is computed in terms of a. The correlation among classifiers is calculated using the following 
expression: 

L 

6 = Vp, ,) t (42) 

l- l 

where P L is the prior probability of class i. The correlation contribution of each class to the overall 
correlation, is proportional to the prior probability of that class. 


Err(aveyErr 



5= l .0 * 

5 = 0.9 » 

5 = 0.8 — 
5 = 0.7 — - 
5 = 0.6 •— — 
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5 = 0.4 

5 = 0.3 - - - 

5 = 0.2 » 
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Figure 3: Error reduction for different classifier error correlations. 


Let us now return to the error region analysis. With this formulation the first and second moments 
of b ave yield: M* ve = 0, and Mf 9e = cr£, V€ . The derivation is identical to that of Section 31 and 
the only change is in the relation between erg and We then get: 


£ ave 
add 


sM.? ve 

3 2 

2 

— 2 17 b*''* 

*2 ( 

l +J(iv- l) 

2 ° b (, 

jV 

&add ^ 

1 + 6(N - 1) 

iV 


(43) 


The effect of the correlation between the errors of each classifier is readily apparent from Equa- 
tion 43. If the errors are independent, then the second part of the reduction term vanishes and the 
combined error is reduced by *V. If on the other hand, the error of each classifier has correlation 
I, then the error of the combiner is equal to the initial errors and there is no improvement due to 
combining. Figure 3 shows how the variance reduction is affected by jV and 5 (using Equation 43). 
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Equation 19 shows the t*rror reduction for correlated. biased classifiers. As long as the bmses ot 
individual rbussifiers an* reduced by a larger amount than r h< * correlated variances. the reduction 
will be similar to those in Section a 2 However, if the biases are not reduced, the improvement gains 
will not be as significant. These results are conceptually identical to those obtained in Section 3. 
but vary in how* the bias reduction r relates to M. In effect, the requirements on reducing 3 are 
low r er than they w r ere previously, since in the presence of bhis. the error reduction is less than jj. 
The practical implication of this observation is that, even in the presence of bias, the correlation 
dependent variance reduction term (given in Equation 43} will often be the limiting factor, and 
dictate the error reductions. 

5.4 Discussion 

In this section we established the importance of the correlation among the errors of individual clas- 
sifiers in a combiner system. One can exploit this relationship explicitly by reducing the correlation 
among classifiers that will be combined. Several methods have been proposed for this purpose and 
many researchers are actively exploring this area [60]. 

Cross-validation, a statistical method aimed at estimating the “true' error [21, 65. 75], can 
also be used to control the amount of correlation among classifiers. By only training individual 
classifiers on overlapping subsets of the data, the correlation can be reduced. The various boosting 
algorithms exploit the relationship between corrlation and error rate by training subsequent classifiers 
on training patterns that have been ’"selected" by earlier classifiers [15. 13. 19, 59] thus reducing the 
correlation among them. Krogh and Vedelsky discuss how cr oss-vali dation can be used to improve 
ensemble performance [36]. Bootstrapping, or generating duferent training sets for each classifier by 
resampling the original set [17, 18, 35, 75], provides another method for correlation reduction [47]. 
Breiman also addresses this issue, and discusses methods aimed at reducing the correlation among 
estimators [9, 10]. Twomey and Smith discuss combining and resampling in the context of a 1 -d 
regression problem [74]. The use of principal component regression to handle multi-collinearitv while 
combining outputs of multiple regressors, was suggested in [42]. Another approach to reducing the 
correlation of classifiers can be found in input decimation, or in purposefully withholding some parts 
of each pattern from a given classifier [70]. Modifying the training of individual classifiers in order 
to obtain less correlated classifiers was also explored [56], and the selection of individual classifier 
through a genetic algorithm is suggested in [46]. 

In theory, reducing the correlation among classifiers that are combined increases the ensemble 
classification rates. In practice however, since each classifier uses a subset of the training data, 
individual classifier performance can deteriorate, thus offsetting any potential gains at the ensemble 
level [70]. It is therefore crucial to reduce the correlations without increasing the individual classifiers’ 
error rates. 


6 Experimental Combining Results 

In order to provide in depth analysis and to demonstrate the result on public domain data sets, we 
have divided this section into two parts. First we will provide detailed experimental results on one 
difficult dataset, outlining all the relevant design steps/parameters. Then we will summarize results 
on several public domain data sets taken from the UCl depository/Probenl benchmarks [50]. 
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Table 3. Combining Results for FS1 


; Classifier! s \ 

Ave 

Med 

Mix 



Min 

1 \ 

! I 

Error 

(7 



Error 

a 

Error 

<7 

Error 

a 


3 



7 25 

0.21 

7.38 

0.37 

7.19 

0.37 

MLP 

5 



7 30 

0.29 

7.32 

0.41 

7.20 

0.37 


7 



7.27 

0.29 

7.27 

0.37 

7.35 

0.30 

1 


6.15 

0.30 

6.42 

0.29 

6.22 

0.34 

6.30 

0.40 

RBF 

til 

6.05 

0.20 

6.23 

0.18 

6.12 

0.34 • 

6.06 

0.39 


KE 

5.97 

0.22 

6.25 

0.20 

6.03 

0.35 

5.92 

0.31 


3 

m 



0.33 

6.48 | 

0.43 

6.39 

0.29 

BOTH 

5 

EH 



0.29 

6.59 

0.40 

6.89 

0.24 

| 

7 

6.08 

■ml 

5.67 

0.27 

6.68 



0.26 


Table 4: Combining Results for FS2. 


Classifier(s) 

Ave 

Med 

Max 

Min 


N 

Error 


Error 

a 






3 

9.32 

EH 

9.47 

|Q 

9.64 

0.47 

9.39 

0.34 

MLP 

5 

9.20 

|ga 

9.22 

EH 

9.73 

0.44 

9.27 

0.30 


7 

9.07 

HI 

9.11 

IH 

9.30 

0.48 

9.25 

0.36 



■Hi 

0.45 

10.49 

0.42 

10.59 

0.57 

10.74 

0.34 

RBF 


Ha 

0.30 

10.51 

0.34 

10.55 

0.40 

10.65 

_0.37 



HI 

0.32 

10.46 

0.31 

10.58 

0.43 

10.66 

0.39 



kcH 

0.57 

9.20 


8.65 

0.47 

9.56 

0.53 


5 

KlIiK 

0.41 

8.97 

i/'m 

8.71 

0.36 

9.50 

0.45 


n 

3.14 

0.28 

3.35 

H™ 

8.79 

0.40 

9.40 

0.39 


• different classifiers trained with a single feature set (fifth and sixth rows); 

• single classifier trained on two different feature sets (seventh and eighth rows). 

There is a striking similarity between these correlation results and the improvements obtained 
through combining. When different runs of a single classifier are combined using only one feature 
set, the combining improvements are very modest. These are also the cases where the classifier 
correlation coefficients are the highest. Mixing different classifiers reduces the correlation, and in 
most cases, improves the combining results. The most drastic improvements are obtained when 
two qualitatively different feature sets are used, which are also the cases with the lowest classifier 
correlations. 


6.2 Probenl Benchmarks 

In this section, examples from the Probenl benchmark set 9 are used to study the benefits of com- 
bining [50]. Table 7 show's the test set error rate for both the MLP and the RBF classifiers on sLx 
different data sets taken from the Probenl benchmarks 10 . 

9 Available from: ftp://ftp.ira.uka.de/pnb/papers/ techrtports/l 094/1 004-01 px.Z. 

l0 These Probenl results correspond to the “pivot” and “no-shortcut” architectures (A and B respectively), reported 
in [50j. The lari'** error in the Probenl no-shortcut architecture for the SO\BEANi problem is not explained. 
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L 1 . 1 1 >le 8: Combining Results tor CANCER I. 


: Classifier! s) 

Ave 

Med 

Max 

Mm 

! 

N 

Error 

t. 7 

Error 

* i 

Error 

, 7 :>tj 

Error 

<7 [ 


3 

0.60 

0.13 

0.63 

0.17 I 

0.66 

0.21 i 

0.66 

0 21 1 

MLP 

5 

0.60 

0.13 

0.58 

0.U0 , 

0.63 

0.17 j 

0.63 

0.17 


i 

0.60 

0.13 

0.58 

0.00 | 

0.60 

0.13 j 

0.60 

0.13 


3 

1.29 

0.48 

1.12 

0.53 j 

1.90 

0.52 1 

0.95 

0.42 

RBF 

5 

1.26 

0.47 

1.12 

0.47 i 

1.31 

0.53- 

0.98 

0.37 

i 


1.32 

0.41 

1.18 

0.43 i 

1.31 

0.53 

0.39 

0.34 


3 

0.86 

0.39 

0.63 

0.13 j 

1.03 

0.53 ; 

0.95 

0.42 

BOTH 

•5 

0.72 

0.25 

0.72 

0.25 

1.38 

0.43 j 

0.33 

0.29 


i 

0.86 

0.39 

0.58 

o.oo ; 

1.49 

0.39 j 

0.83 

0.34 


Table 9: Combining Results for CARPI. 


Classifier(s) 

Ave 

Med 

Max 

Min 


N 

Error 

a 

Error 

a 

Error 

a 

Error 

a 


3 

13.37 

0.45 

13.61 

0.56 

13.43 

0.44 

13.40 

0.47 

MLP 

5 

13.23 

0.36 

13.40 

0.39 

13.37 

0.45 

13.31 

0.40 


i 

13.20 

0.26 

13.29 

0.33 

13.26 

0.35 I 

13.20 

0.32 


3 

13.40 

0.70 

13.58 

0.76 : 

14.01 

0.66 

13.08 

1.05 

RBF 

5 

13.11 

0.60 

13.29^ 

0.67 

13.95 

0.66 

12.88 

0.98 


7 

13.02 

0.33 

12.99 

0.33 

1375 

0.76 

12.82 

0.67 


3 

13.75 

0.69 

13.69 

0.70 

13.49 

0.62 

13.66 

0.70 

BOTH 

5 

13.78 

0.55 

13.66 

0.67 

13.66 

0.65 

13.75 

0.64 


7 

13.34 

0.51 

13.52 

0.58 

13.66 

0.60 

13.72 

0.70 


The CARD1 data set consists of credit approval decisions [51. 52]. 51 inputs are used to determine 
whether or not to approve the credit card application of a customer. There are 690 examples in 
this set, and 345 are used for training. The MLP has one hidden layer with 20 units, and the RBF 
network has 20 kernels. 

The DIABETES 1 data set is based on personal data of the Pima Indians obtained from the 
National Institute of Diabetes and Digestive and Kidney Diseases [63]. The binary output determines 
whether or not the subjects show signs of diabetes according to the World Health Organization. 
The input consists of 8 attributes, and there are 768 examples in this set, half of which are used for 
training. MLPs with one hidden layer with 10 units, and RBF networks with 10 kernels are selected 
for this data set. 

The GENE1 is based on intron/exon boundary detection, or the detection of splice junctions 
in DNA sequences [45, 66]. 120 inputs are used to determine whether a DNA section is a donor, 
an acceptor or neither. There are 3175 examples, of which 1588 are used for training. The MLP 
architecture consists of a single hidden layer network with 20 hidden units. The RBF network has 
10 kernels. 

The GLASS 1 data set is based on the chemical analysis of glass splinters. The 9 inputs are used 
to classify 6 different types of glass. There are 214 examples in this set, and 107 of them are used 
for training. MLPs with a single hidden layer of 15 units, and RBF networks with 20 kernels are 
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| Classifier' -J 

Ave 

Med 

Max 

Min i 

L 


N 

Error 

t T 

Error 

>7 

Error 

>T 

Error 

t7 


- 

i 

32.07 

0.00 

32.07 

0.00 

32.07 

0.00 

32.07 

0.00 

MLP 

5 

32.07 

0.00 

32.07 

0.00 

32.07 

0.00 

32.07 

0.00 


i 

32.07 

0.00 

32 07 


32.07 

0.00 

32.07 

0.00 


3 

29.31 

2.23 

30.76 

2.74 

30.28 

2.02 

29.43 

2.89 

RBF 

5 

29.23 

L 34 

30.19 

1.69 

30.35 

2.00 • 

28.30 

2.46 


i 

29.06 

1.51 

30.00 

1.38 

31.89 

1.78 

27.55 

1.33 


3 

30.66 

2.52 

29.06 

2.02 

33.37 

1.74 

29.91 

2.25 

BOTH 

5 

32.36 

1.82 

23.30 

1.46 

33.68 

1.32 

29.72 

1.78 

| 

i 

32.45 

0.96 

27.93 

1.75 

34.15 

1.68 

29.91 

1.61 


Table 13: Combining Results for SOYBEAN!. 


Classifier(s) 

Ave 

Med 

Max 

Min 


N 

Error 

a 

Error 

<7 

||£i| 



i— 1 



7.06 

K 


0.13 

7.06 

0.00 

7.85 

1.42 

MLP 

5 

7.06 


Klill 

0.00 

7.06 

0.00 

8.38 

1.63 


7 

7.06 

KfaS 

7.06 

0.00 

7.06 

0.00 

8.88 

1.68 


3 

7.74 



0.42 

7.35 

0.47 

7.77 

n 

RBF 

5 

7.62 

0.23 


0.30_ 

7.77 

0.30 

7.65 

m 


7 

7.GS 

0.23 


0.33 

7.68 

0.29 

7.59 

0.45 


3 

7.18 

0.23 

7.12 


7.56 

0.28 

7.85 

1.27 

BOTH 

5 

7.18 

0.23 

7.12 

0.17 


0.25 


1.22 


t 

7.18 

0.24 } 

7.13 

0.23 

7.50 

0.25 

8.09 

1.05 


in most cases. If the combined bias is not lowered, the combiner will not outperform the better 
classifier. Second, as discussed in section 5.2. the correlation plays a major role in the final reduction 
factor. There are no guarantees that using different types of classifiers will reduce the correlation 
factors. Therefore, the combining of different types of classifiers, especially when their respective 
performances are significantly different (the error rate for the RBF network on the CANCER1 data 
set is over twice the error rate for MLPs) has to be treated with caution. 

Determining which combiner (e.g. ave or rued), or which classifier selection (e.g. multiple MLPs 
or MLPs and RBFs) will perform best in a given situation is not generally an easy task. However, 
some information can be extracted from the experimental results. The linear combiner, for example, 
appears more compatible with the MLP classifiers than with the RBF networks. When combining 
two types of network, the med combiner often performs better than other combiners. One reason for 
this is that the outputs that will be combined come from different sources, and selecting the largest 
or smallest value can favor one type of network over another. These results emphasize the need for 
closely coupling the problem at hand with a classifier/combiner. There does not seem to be a single 
type of network or combiner that can be labeled “best” under all circumstances. 

























uverf mining. but not undertraining (except in cases where the uudertraining is v«*rv mild}. This 
mrrnborah-s well with the theoretical framework which shows combining to be more effective at 
variance reduction than bi.as reduction. 

The classification rates obtained by the order statistics combiners in section 0 are in general, 
comparable to those obtained by averaging The advantage of OS approaches should be more evident 
m situations where there is substantial variability in the performance of individual classifiers, and the 
thus robust properties of OS combining can be brought to bear upon. Such variability in individual 
performance may be due to, for example, the classifiers being geographically distributed and working 
only on locally available data of highly varying quality. Current work by the authors indicate that 
this is indeed the case, but the issue needs to be examined in greater detail. 

One final note that needs to be considered is the behavior of combiners for a large number of 
classifiers (A). Clearly, the errors cannot be arbitrarily reduced by increasing N indefinitely. This 
observation however, does not contradict the results presented in this analysis. For large N\ the 
assumption that the errors were i.i.d. breaks down, reducing the improvements due to each extra 
classifier. The number of classifiers that yield the best results depends on a number of factors, 
including the number of feature sets extracted from the data, their dimensionality, and the selection 
of the network architectures. 
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